Complete the following steps, using data on cities and towns in the US, area_data.Rds

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.9      ✓ rsample      0.1.0 
## ✓ dials        0.0.10     ✓ tune         0.1.6 
## ✓ infer        1.0.0      ✓ workflows    0.2.4 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.1.0 
## ✓ parsnip      0.1.7      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.17
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(scales)
ad<-read_rds("area_data.Rds")
  1. Prep the data for analysis by creating a training dataset with 75 percent of the data and a testing dataset with 25 percent of the data.
split_data<-initial_split(ad)

ad_train<-training(split_data)

ad_test<-testing(split_data)
  1. Plot home ownership perc_homeown (dependent variable) as a function of college education college_educ and comment on what you see.
ad%>%
  ggplot(aes(x=perc_homeown))+
  geom_density()

ad%>%
  ggplot(aes(x=college_educ))+
  geom_density()

gg<-ad%>%
  ggplot(aes(y=perc_homeown,
             x=college_educ,
             text=paste(name,
                        "<br>",
                      "Homeownership:", percent(perc_homeown/100,accuracy = 1), 
                      "<br>",
                      "% College Educated:", percent(college_educ/100,accuracy=1))))+
  geom_point()

ggplotly(gg,tooltip ="text")

Modest negative relationship between percent college educated and percent owning their own homes.

  1. Run a regression of home ownership on college education (home ownership is the dependent variable, percent college educated is the independent variable).
ad_formula<-as.formula("perc_homeown~college_educ")
ad_rec<-recipe(ad_formula,ad_train)
lm_fit<-linear_reg()%>%
  set_engine("lm")%>%
  set_mode("regression")
ad_wf<-workflow()%>%
  add_recipe(ad_rec)%>%
  add_model(lm_fit)
ad_wf<-ad_wf%>%
  fit(ad_train)
ad_wf%>%tidy()
## # A tibble: 2 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    72.1      0.711     101.   0        
## 2 college_educ   -0.119    0.0277     -4.32 0.0000181
  1. Summarize the relationship between home ownership and college education in a sentence.

For a one unit change in the percent of the population with a college degree, home ownership is predicted to decline by .12 percentage points.

  1. Calculate the root mean squared error from the above model in the testing dataset and comment on what it means.
ad_lf<-ad_wf%>%last_fit(split_data)

ad_lf$.metrics
## [[1]]
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      6.44   Preprocessor1_Model1
## 2 rsq     standard      0.0284 Preprocessor1_Model1

The predictions from the model including only college education are off by about 6.9 percentage points on average.

  1. Add the variable for income to the above regression and repeat steps 3-5.
ad_formula<-as.formula("perc_homeown~college_educ+income_75")
ad_rec<-recipe(ad_formula,ad_train)
ad_wf<-workflow()%>%
  add_recipe(ad_rec)%>%
  add_model(lm_fit)
ad_wf<-ad_wf%>%
  fit(ad_train)
ad_wf%>%tidy()
## # A tibble: 3 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    66.5      1.00       66.4  4.86e-302
## 2 college_educ   -0.308    0.0363     -8.50 1.16e- 16
## 3 income_75       0.298    0.0389      7.66 6.41e- 14

There’s a positive relationship between income and home ownership. For a one unit increase in the percent of people making over 75,000, home ownership is predicted to increase by .3 percentage points, even after controlling for percent college educated.

ad_lf<-ad_wf%>%last_fit(split_data)

ad_lf$.metrics
## [[1]]
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      6.29   Preprocessor1_Model1
## 2 rsq     standard      0.0727 Preprocessor1_Model1

The accuracy of the model is slightly better, but the predictions in the testing dataset are still off by about 6.3 percentage points.

  1. Add the variable for census division (division) to your model and again repeat steps 3-5.

Levels of division

ad%>%group_by(division)%>%count()
## # A tibble: 9 × 2
## # Groups:   division [9]
##   division               n
##   <fct>              <int>
## 1 East North Central   159
## 2 West North Central   120
## 3 Mid-Atlantic          66
## 4 New England           26
## 5 East South Central    95
## 6 South Atlantic       153
## 7 West South Central   131
## 8 Mountain              94
## 9 Pacific               82
ad_formula<-as.formula("perc_homeown~college_educ+income_75+division")
ad_rec<-recipe(ad_formula,ad_train)%>%
  step_dummy(division)
ad_wf<-workflow()%>%
  add_recipe(ad_rec)%>%
  add_model(lm_fit)
ad_wf<-ad_wf%>%
  fit(ad_train)
ad_wf%>%tidy()
## # A tibble: 11 × 5
##    term                        estimate std.error statistic   p.value
##    <chr>                          <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                   69.7      1.08      64.3   5.78e-292
##  2 college_educ                  -0.333    0.0330   -10.1   1.87e- 22
##  3 income_75                      0.314    0.0374     8.39  2.89e- 16
##  4 division_West.North.Central   -1.38     0.769     -1.80  7.26e-  2
##  5 division_Mid.Atlantic         -0.421    0.951     -0.443 6.58e-  1
##  6 division_New.England           1.00     1.40       0.715 4.75e-  1
##  7 division_East.South.Central   -3.71     0.837     -4.43  1.12e-  5
##  8 division_South.Atlantic       -4.05     0.724     -5.59  3.22e-  8
##  9 division_West.South.Central   -5.43     0.775     -7.01  5.71e- 12
## 10 division_Mountain             -2.77     0.823     -3.36  8.12e-  4
## 11 division_Pacific             -10.6      0.876    -12.1   6.93e- 31

The percent of people owning their own homes is lower in almost all census divisions than in the reference category of East North Central. Homeonwership rates in Mid Atlantic and in New England are not observably different than in the East North Central Division.

ad_lf<-ad_wf%>%last_fit(split_data)

ad_lf$.metrics
## [[1]]
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       5.49  Preprocessor1_Model1
## 2 rsq     standard       0.295 Preprocessor1_Model1

Including census division results in a more accurate model. The rmse is now 5.2, indicating that the predictions in the testing dataset are off by about 5.2 percentage points.

About the Data

This data comes from the American Community Survey of 2019. It covers all of the metro or micro statistical areas in the United States. It includes characteristics of these areas, include education, income, home ownership and others as described below.

Name Description
name Name of Micro/Metro Area
college_educ Percent of population with at least a bachelor’s degree
perc_commute_30p Percent of population with commute to work of 30 minutes or more
perc_insured Percent of population with health insurance
perc_homeown Percent of housing units owned by occupier
geoid Geographic FIPS Code (id)
income_75 Percent of population with income over 75,000
perc_moved_in Percent of population that moved from another state in last year
perc_in_labor force Percent of population in labor force
metro Metropolitan Area? Yes/No
state State Abbreviation
region Census Region
division Census Division